Last code execution: 2023 September 05, Tuesday @ 21:09:02.
1 Intro
Using the proteomicslfq nextflow (Di Tommaso et al. 2017) pipeline I processed the thermo raw files I received from the CRG/UPF proteomics facility and calculate the peptide spectral matches (PSMs) against a protein database I build from Mouse UniProt (version 2022_03 3th Aug 2022) using both canonical and alternative isoforms as well as common contaminants.
The pipeline was run as following on the CRG cluster.
The option --protein_quant 'shared_peptides' indicates to quantify proteins based on peptides mapping to single and multiple proteins too, but only mapping peptides greedily for its best group (by inference score). This inference is controlled by --protein_inference 'bayesian' that indicates in order to group proteins, it calculate scores on the protein (group) level and to potentially modify associations from peptides to proteins using bayesian statistics.
The --quantification_method 'spectral_counting' indicated that spectral counting of PSMs is used to quantify the protein intensities.
First I copy the output quantification pipeline out.mzTab files to a local folder.
# Function to extract number of DE proteinsDE_prots <-function(results) {tibble(Dataset =gsub("_results", "", results),Num_Signif_Proteins =get(results) |>filter(significant) |>nrow(),Proteins_Names =get(results) |>filter(significant) |>pull(name) |>paste0(collapse =" / ") )}# Function that wraps around test_diff, add_rejections and get_results functionsDE_analysis <-function(se, signif_thrshld =0.05) { data_diff <-test_diff(se, type ="manual", test ='WT_vs_Dexon4') res <-get_results(add_rejections(data_diff, alpha = signif_thrshld, lfc =1.25))return(res)}# Function to obtain ROC dataget_ROC_df <-function(results) {get(results) |>select(name, WT_vs_Dexon4_p.val, significant) |>mutate(DE =grepl(T, significant),BG =grepl(F, significant)) |>arrange(WT_vs_Dexon4_p.val) -> tmpmutate(tmp,TPR =cumsum(as.numeric(DE)) /length(which(tmp$DE)),FPR =cumsum(as.numeric(BG)) /length(which(tmp$BG)),method = results) -> tmpreturn(tmp)}# Function that wraps around test_diff, add_rejections and get_results functionsDE_analysis_KOrL_KOrS <-function(se, signif_thrshld =0.05) { data_diff <-test_diff(se, type ="manual", test ='KOrL_vs_KOrS') res <-get_results(add_rejections(data_diff, alpha = signif_thrshld, lfc =1.5))return(res)}
2.3 Directories & File Paths
Here I organise all the variables I need to run the analysis and define where to save the processed tables and figures.
Here I analyse 2 datasets. First a Suz12 IP in mESCs of WT and ∆exon4 cell lines, then a Flag IP of Suz12 KO and Suz12 KO rescued with either long or short isoforms, also in mESCs.
3.2.1 Import PSMs from mzTab file
From the proteomicslfq pipeline the important mzTAB file and do some tidying up of the data.
Code
prot_WT_Dex4 <-readMzTabData(file = WT_Dex4_mzTAB_path, what ="PRT", version ="1.0")exprs(prot_WT_Dex4) |>as.data.frame() |>rownames_to_column(var ="info") |>setNames(c("info", "WTp", "WT#1", "WT#2", "dEX4#1", "dEX4#2", "dEX4#3") ) |> tidyr::drop_na() |>mutate(swissprot =str_extract(pattern ="^[s-t][p|r]", string = info)) |># extract the UniProt ID while preserving alternative isoform ID (dashed number)mutate(UniProtID =str_extract(pattern ="(?<=^[s-t][p|r].)[A0-Z9](.*?)\\.([1-9]?)", string = info)) |>mutate(UniProtID =sub(pattern ="\\.", replacement ="-", x = UniProtID)) |>mutate(UniProtID =sub(pattern ="-$", replacement ="", x = UniProtID)) |># Extract species namemutate(Species =str_extract(pattern ="_[A0-Z9](.*?)(?=\\.1)", string = info)) |>mutate(Species =sub(pattern ="^_", replacement ="", x = Species)) |># Not really the protein name, it's an abbreviation used by uniprotmutate(ProteinName =str_extract(pattern ="(?<=^[s-t][p|r]\\.[A0-Z9].....).*?\\.[A0-Z9](.*?)(?=_)",string = info)) |>mutate(ProteinName =str_remove(string = ProteinName, pattern ="(.*?)\\.")) |>mutate(ProteinName =str_remove(string = ProteinName, pattern ="([1-9]?)\\.")) |>mutate(ProteinName =gsub(pattern ="\\.", "-", x = ProteinName )) |>mutate(ProteinName =str_to_title(string = ProteinName) ) -> tidy_prot_wt_dEx4
Here I manually fix 2 protein names. First Jarid2 is called Jard2, this is because it comes from JARD2_MOUSE name used by UniProt. The second is the protein called EED which in this experiment is mapped to the not so well annotated protein with ID A0A5F8MPX8. After some digging into the mzTAB file I believe this is happening because the proteomics engines map the peptide RLGAICDSGGGGGGGGAGSFAAGSGR to A0A5F8MPX8 which cannot be mapped to EED as this seems to be in a longer N-terminus stretch compared to the canonical EED.
Code
tidy_prot_wt_dEx4 |># Fix Jarid2 name that comes from JARD2_MOUSEmutate(ProteinName =sub(pattern ="Jard2", replacement ="Jarid2", x = ProteinName)) |>mutate(ProteinName =ifelse(test = UniProtID =="A0A5F8MPX8", yes ="Eed", no = ProteinName)) |>relocate(swissprot, UniProtID, Species, ProteinName, .after = info) -> tidy_prot_wt_dEx4
Other mis-labelled protein is for example A0A3B2WBM3 which actually corresponds to Elongin-B. Probably the mis-labelling is for the same reason, so I fix it like in the case of Eed adding the PSMs to the other Elob protein with UniProtID P62869.
Code
tidy_prot_wt_dEx4 |># Fix Elongin-Bmutate(ProteinName =ifelse(test = UniProtID =="A0A3B2WBM3", yes ="Elob", no = ProteinName)) |>subset(ProteinName =="Elob") |>mutate(across(.cols =c("WTp", "WT#1", "WT#2", "dEX4#1", "dEX4#2", "dEX4#3"), .fns = sum )) |># Of the 2 IDs keep only the sp onesubset(swissprot =="sp") -> tidy_Elob_sum# remove the Elob individual quantification and add the summed valuestidy_prot_wt_dEx4 |>subset(ProteinName !="Elob") |>rbind(tidy_Elob_sum) -> tidy_prot_wt_dEx4
3.2.2 Calculate a stoichiometry score for PRC2 proteins.
Group proteins by PRC2 core or subgroup and calculate p-values with t.test(). Remove the embryonic Aebp2 isoform (Q9Z248-3) with very very low PSMs from this plot for sake of simplicity.
Code
PRC2_data <-subset(spec_ratio, ProteinName %in% PRC2)PRC2_data <-subset(spec_ratio, UniProtID !="Q9Z248-3")# Subdivide by complexPRC2_data |>mutate(Complex =case_when(ProteinName %in% core ~"PRC2 Core", ProteinName %in% PRC2.1~"PRC2.1", ProteinName %in% PRC2.2~"PRC2.2", )) |>relocate(Complex, .after = Condition) -> PRC2_data# Set decreasing plotting orderPRC2_data |>subset(Condition =="WT") |>arrange(desc(Mean_Ratio)) |>pull(ProteinName) |>unique() -> genes_order# Slight adjustment to have them in a miningful order genes_order <-factor(genes_order, levels = PRC2)PRC2_data$ProteinName <-factor(PRC2_data$ProteinName, levels = PRC2)## calculate significance# Select genes with a mean ∆ stoichiometry ratio <= -0.1PRC2_data |>group_by(ProteinName, Condition, Mean_Ratio) |>summarise() |>ungroup() |>group_by(ProteinName) |>mutate(Delta_Ratio =round(Mean_Ratio - Mean_Ratio[Condition =="WT"], 2)) |>relocate(Delta_Ratio, .after = Condition) |>arrange(Delta_Ratio) |>subset(abs(Delta_Ratio) >=0.1) |>pull(ProteinName) -> genes_to_testPRC2_data |>subset(ProteinName %in% genes_to_test) |>group_by(ProteinName) |>mutate(Pval =t.test(x = Ratio[Condition =="WT"], y = Ratio[Condition =="∆ex4"], exact = F, alternative ="two.sided")$p.value ) |>mutate(Pval =signif(Pval, 2) ) |>relocate(Pval, .after = Condition) |>mutate(Y_Pos =max(Mean_Ratio + Sd_Ratio +0.1)) |>select(ProteinName, Condition, Complex, Pval, Mean_Ratio, Sd_Ratio, Y_Pos) |>unique() |>subset(Pval <=0.05) -> signif_df
Create a summarised Experiment with the PSMs. The function make_se takes the PSMs counts and transforms them in log2 scale. So that values that are 1 become zero and values that were zero PSMs become NA.
There aren’t a lot of missing values but for consistency with the other dataset I impute the missing proteins values anyway.
Code
plot_detect(data_filt)
Impute missing values with different methods and then pick one. The imputation is performed on the dataset where proteins missing too many values were already filtered out (see filter_missval above), and on the vsn normalised data (see normalize_vsn above). For details on imputation algorithms check ?MsCoreUtils::impute_matrix().
Code
# Impute missing data using random draws from a # Gaussian distribution centred around a minimal value (for MNAR: missing not at random )data_imp_min <- DEP::impute(data_norm, fun ="MinProb", q =0.01)
[1] 0.5854005
Code
# Impute missing data using random draws from a # manually defined left-shifted Gaussian distribution (for MNAR: missing not at random )data_imp_man <- DEP::impute(data_norm, fun ="man", shift =1.8, scale =0.3)# Impute missing data using the k-nearest neighbour approach (for MAR: missing at random)data_imp_knn <- DEP::impute(data_norm, fun ="knn", rowmax =0.9)# Impute missing data using the Maximum likelihood-based imputation method using the EM algorithm (for MAR: missing at random)data_imp_mle <- DEP::impute(data_norm, fun ="MLE")
Iterations of EM:
1...2...3...4...5...6...7...
Code
## Add zero to missing datadata_imp_zer <- DEP::impute(data_norm, fun ="zero")
Here I also try to impute missing values in a tailor made fashion, using a mixed model as described in the advanced section of the DEPpackage vignette.
Here I consider a protein to have missing values NOT at random (MNAR) if it has missing values where the sum of the filtered, normalised PSMs across replicates is less than 1. Meaning that if a protein is detected as NA, NA, and 1 across 3 biological replicates of the ∆exon4 samples, the missing values (if any) will be set to the minumum value in the range (i.e. ~ -3).
Code
get_df_long(data_norm) |>group_by(name, condition) |>summarize(Sum_PSMs =sum(intensity, na.rm = T), .groups ='keep' ) |>subset(condition =="Dexon4" ) |>summarise(NAs = Sum_PSMs <=1, .groups ='keep') |>subset(NAs) |>pull(name) |>unique() -> proteins_MNAR# Get a logical vectorMNAR <-names(data_norm) %in% proteins_MNARmessage("Identified: ", length(proteins_MNAR), " proteins missing NOT at random which will be min-imputed")
Identified: 128 proteins missing NOT at random which will be min-imputed
Code
# Perform a mixed imputationdata_imp_mixed <- DEP::impute(data_norm, fun ="mixed",randna =!MNAR, # we have to define MAR which is the opposite of MNARmar ="knn", # imputation function for MARrowmax =0.9,mnar ="min") # imputation function for MNAR
Plot the protein intensity distribution of the original dataset with lots of missing values, the filtered, the normalised one, and the 3 different type of data imputation methods.
Between the mixed imputation and the knn imputation there’s barely any difference. Good, it means that there’re many missing values missing not at random.
Perform a PCA on the mixed imputed samples. As expected the samples major driver of diversity is the Suz12 antibody used (either homemade on the left or the commercial one on the right).
Code
plot_pca(data_imp_mixed, x =1, y =2, n =500, point_size =4) +scale_colour_manual(values =c("WT"="royalblue3", "Dexon4"="goldenrod")) +theme(plot.background =element_blank(), plot.title =element_blank(),panel.background =element_blank())
Following the info in the DEP vignette I test which imputation method is the best.
The methods I tested are:
No imputation, but filtering and normalisation
MinProb imputation
man shifted Gaussian imputation
knn imputation
MLE imputation
zero imputation
Mixture of min for MNAR in Suz12 KO and knn for MAR
For this comparison only I use a significant threshold of 0.05 and a log2 Fold Change higher than |1.25|.
Code
# DE analysis on no, MinProb, man, knn, and MLE imputationsno_imputation_res <-DE_analysis(data_norm)minprob_imputation_res <-DE_analysis(data_imp_min)man_imputation_res <-DE_analysis(data_imp_man)knn_imputation_res <-DE_analysis(data_imp_knn)mle_imputation_res <-DE_analysis(data_imp_mle)zer_imputation_res <-DE_analysis(data_imp_zer)mix_imputation_res <-DE_analysis(data_imp_mixed)
Check how many differential enriched proteins there are in each dataset.
Code
# Number of significant proteinsobjects <-c("no_imputation_res", "minprob_imputation_res", "man_imputation_res", "knn_imputation_res", "mle_imputation_res", "zer_imputation_res", "mix_imputation_res")kable(purrr::map_df(objects, DE_prots))
Basically all methods are good and show no significant improvement over no imputation of missing values. However for consistency with the other dataset I decided to use the mixed imputation method and get the final result table for the volcano plot.
Consider significant everything with an FDR <= 0.05 and a log2 fold change more than 1.5.
Code
# Test every sample versus controldata_diff <-test_diff(data_imp_mixed, type ="control", control ="WT")
Tested contrasts: Dexon4_vs_WT
Code
# Denote significant proteins based on user defined cutoffsdep <-add_rejections(data_diff, alpha =0.05, lfc =1.5 )
Plot sample correlation and see that replicates have very good correlation
Code
plot_cor(dep, significant =TRUE, lower =0, upper =1, pal ="PuOr")
Check the intensity of the significant proteins found in the experiment
Since I notice that Ezh1 and Kmt2b are significantly enriched but have a log2 Fold Change lower than 2 I explore their significance in the light of the fact that they are very low PSMs. To address this I make an MA plot like it’s done for RNA-seq.
As one can see in fact both Ezh1 and Kmt2b (and Esco2) have very low PSMs (below 1). This means that they could be actually differentially bond to Suz12 WT and Suz12 ∆exon4 however their signal is so low it is hard to draw conclusions form it.
Now perform the same analysis but on the other dataset.
4.1.1 Import PSMs from mzTab file for the rescue dataset
Code
prot_Res_KO <-readMzTabData(file = Resce_KO_mzTAB_path, what ="PRT", version ="1.0")# tidy up the dataexprs(prot_Res_KO) |>as.data.frame() |>rownames_to_column(var ="info") |>setNames(c("info", "SKOrL_1", "SKOrL_2", "SKOrL_3","SKOrS_1", "SKOrS_2", "SKOrS_3", "SKO_1", "SKO_2", "SKO_3") ) |> tidyr::drop_na() |>mutate(swissprot =str_extract(pattern ="^[s-t][p|r]", string = info)) |># extract the UniProt ID while preserving alternative isoform ID (dashed number)mutate(UniProtID =str_extract(pattern ="(?<=^[s-t][p|r].)[A0-Z9](.*?)\\.([1-9]?)", string = info)) |>mutate(UniProtID =sub(pattern ="\\.", replacement ="-", x = UniProtID)) |>mutate(UniProtID =sub(pattern ="-$", replacement ="", x = UniProtID)) |># Extract species namemutate(Species =str_extract(pattern ="_[A0-Z9](.*?)(?=\\.1)", string = info)) |>mutate(Species =sub(pattern ="^_", replacement ="", x = Species)) |># Not really the protein name, it's an abbreviation used by uniprotmutate(ProteinName =str_extract(pattern ="(?<=^[s-t][p|r]\\.[A0-Z9].....).*?\\.[A0-Z9](.*?)(?=_)",string = info)) |>mutate(ProteinName =str_remove(string = ProteinName, pattern ="(.*?)\\.")) |>mutate(ProteinName =str_remove(string = ProteinName, pattern ="([1-9]?)\\.")) |>mutate(ProteinName =gsub(pattern ="\\.", "-", x = ProteinName )) |>mutate(ProteinName =str_to_title(string = ProteinName) ) -> tidy_prot_rescue_KO
Here I manually fix 2 protein names like in the previous dataset. First Jarid2 is called Jard2, this is because it comes from JARD2_MOUSE name used by UniProt. The second is the protein called EED which in this experiment is mapped to the not so well annotated protein with ID A0A5F8MPX8.
Code
tidy_prot_rescue_KO |># Fix Jarid2 name that comes from JARD2_MOUSEmutate(ProteinName =sub(pattern ="Jard2", replacement ="Jarid2", x = ProteinName)) |>mutate(ProteinName =ifelse(test = UniProtID =="A0A5F8MPX8", yes ="Eed", no = ProteinName)) |>relocate(swissprot, UniProtID, Species, ProteinName, .after = info) -> tidy_prot_rescue_KO
4.1.2 Calculate a stoichiometry score for PRC2 proteins in the rescue dataset
Group proteins by PRC2 core or subgroup and calculate p-values with t.test(). Remove the embryonic Aebp2 isoform (Q9Z248-3) with very very low PSMs from this plot for sake of simplicity and also an Ezh1 variant (P70351-2).
Now I can use some of the plotting functions of DEP to explore the data.
Code
plot_frequency(data_se)
Test different filtering thresholds. The function filter_missval filters the dataset for proteins that have a maximum of thr missing values in at least one condition. If there are 3 biological replicates thr = 0 means to keep only the proteins that are identified in all 3 replicates and thr = 1 keeps all the proteins that are identified in at least 2 samples of each biological condition. I’ll stick to thr = 0 but I tried less stringent filter with little difference.
Now the number of proteins identified is different
Code
plot_frequency(data_filt)
I can also check the number of proteins identified in each sample. For sure Suz12 KO rescued with Long Suz12 has the highest number of samples. And in general replicates have similar number of identified proteins.
I think vsn normalisation is increasing too much the intensities in the Suz12 KO samples (grey).
4.1.6 Impute missing values
Check the missing values in the dataset.
Code
plot_missval(data_norm)
As expected in MS proteomics experiments there are a lot of missing values, and this missing proteins are generally of low intensity. Here in this dataset, the Suz12 KO samples have most of the missing values and Suz12KO rescued with Short isoform to a certain extent. I believe this is because most of the proteins identified in the Suz12 KO are unspecific background proteins. This makes sense ad they’re used as control and do not pull down Suz12.
Code
plot_detect(data_filt)
Impute missing values with different methods and then pick one. The imputation is performed on the dataset where proteins missing too many values were already filtered out (see filter_missval above), and on the vsn normalised data (see normalize_vsn above). For details on imputation algorithms check ?MsCoreUtils::impute_matrix().
Code
# Impute missing data using random draws from a # Gaussian distribution centred around a minimal value (for MNAR: missing not at random )data_imp_min <- DEP::impute(data_norm, fun ="MinProb", q =0.01)
[1] 1.466768
Code
# Impute missing data using random draws from a # manually defined left-shifted Gaussian distribution (for MNAR: missing not at random )data_imp_man <- DEP::impute(data_norm, fun ="man", shift =1.8, scale =0.3)# Impute missing data using the k-nearest neighbour approach (for MAR: missing at random)data_imp_knn <- DEP::impute(data_norm, fun ="knn", rowmax =0.9)# Impute missing data using the Maximum likelihood-based imputation method using the EM algorithm (for MAR: missing at random)data_imp_mle <- DEP::impute(data_norm, fun ="MLE")
Iterations of EM:
1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...21...22...23...24...25...26...27...28...29...30...31...32...33...34...35...36...37...38...39...40...41...42...43...44...45...46...47...48...49...
Code
## Add zero to missing datadata_imp_zer <- DEP::impute(data_norm, fun ="zero")
Here I also try to impute missing values in a tailor made fashion, using a mixed model as described in the advanced section of the DEPpackage vignette.
Here I consider a protein to have missing values NOT at random (MNAR) if it has missing values in all replicates of Suz12 KO condition or if the sum of the filtered, not-normalised PSMs across replicates is less than 1. Meaning if a protein is detected as NA, NA, and 1 in 3 biological replicates of Suz12 KO, the missing values to be set to the minumum value in the range (i.e. ~ -3).
Code
get_df_long(data_filt) |>group_by(name, condition) |>summarize(Sum_PSMs =sum(intensity, na.rm = T), .groups ='keep' ) |>subset(condition =="KO") |>summarise(NAs = Sum_PSMs <=1, .groups ='keep') |>subset(NAs) |>pull(name) |>unique() -> proteins_MNARmessage("Identified: ", length(proteins_MNAR), " proteins missing NOT at random which will be zero-imputed")
Identified: 665 proteins missing NOT at random which will be zero-imputed
Code
# Get a logical vectorMNAR <-names(data_norm) %in% proteins_MNAR# Perform a mixed imputationdata_imp_mixed <- DEP::impute(data_norm, fun ="mixed",randna =!MNAR, # we have to define MAR which is the opposite of MNARmar ="knn", # imputation function for MARrowmax =0.9,mnar ="min") # imputation function for MNAR
Plot the protein intensity distribution of the original dataset with lots of missing values, the filtered, the normalised one, and the 3 different type of data imputation methods.
Do a dimensionality reduction just to check that the imputation still preserves good sample relationships.
Code
plot_pca(data_imp_mixed, x =1, y =2, n =500, point_size =4) +scale_colour_manual(values =c("KOrS"="darkorange1", "KOrL"="mediumpurple3", "KO"="gray84")) +theme(plot.background =element_blank(), plot.title =element_blank(),panel.background =element_blank())
Following the info in the DEP vignette I test which imputation method is the best.
The methods I tested are:
No imputation, but filtering and normalisation
MinProb imputation
man shifted Gaussian imputation
knn imputation
MLE imputation
zero imputation
Mixture of min for MNAR in Suz12 KO and knn for MAR
For this comparison only I use a significant threshold of 0.05 and |1.5| log2 Fold change.
Code
# DE analysis on no, MinProb, man, knn, and MLE imputationsno_imputation_res <-DE_analysis_KOrL_KOrS(data_norm)minprob_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_min)man_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_man)knn_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_knn)mle_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_mle)zer_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_zer)mix_imputation_res <-DE_analysis_KOrL_KOrS(data_imp_mixed)
Check how many differential enriched proteins there are in each dataset
Code
# Number of significant proteinsobjects <-c("no_imputation_res", "minprob_imputation_res", "man_imputation_res", "knn_imputation_res", "mle_imputation_res", "zer_imputation_res", "mix_imputation_res")kable(purrr::map_df(objects, DE_prots))
Basically all methods are good and better than no imputation. So I settled on using the mixed imputation method and get the final result table for the volcano plot.
4.1.7 Test against Suz12 KO as control
Here the SUZ12 KO acts as a control.
Code
data_imp_mixed |>test_diff(type ="control", control ="KO") |>add_rejections(alpha =0.05) -> res_KO
Volcano plot: KO+SUZ12-Short
Code
plot_volcano(dep = res_KO, contrast ="KOrS_vs_KO") +labs(title ='KO+S vs KO')
Volcano plot: KO+SUZ12-Long
Code
plot_volcano(dep = res_KO, contrast ="KOrL_vs_KO") +labs(title ='KO+L vs KO')
Remove from the dataset the protein hits that are significant in the Suz12 KO pull downs.
Code
get_results(res_KO) |>subset(KOrL_vs_KO_ratio <-0.5) |>subset(KOrL_vs_KO_p.adj <=0.05) |>pull(ID) -> KO_hits_vs_Longget_results(res_KO) |>subset(KOrS_vs_KO_ratio <-0.5) |>subset(KOrS_vs_KO_p.adj <=0.05) |>pull(ID) -> KO_hits_vs_LongKO_hits <-unique(c(KO_hits_vs_Long, KO_hits_vs_Long))message("Found : ", length(KO_hits), " proteins that are significantly enriched in the Suz12 KO. I'll discard them as they're unspecific")
Found : 8 proteins that are significantly enriched in the Suz12 KO. I'll discard them as they're unspecific
4.1.8 Test KO+SUZ12-S vs KO+SUZ12-L
Here the KO+SUZ12-L acts as a control.
Code
data_imp_mixed |>test_diff(type ="control", control ="KOrL") |>add_rejections(alpha =0.05) |>plot_volcano(contrast ="KOrS_vs_KOrL") +labs(title ='KO+S vs KO+L')
Tested contrasts: KOrS_vs_KOrL, KO_vs_KOrL
Consider significant everything with an FDR <= 0.05 and a log2 fold change more than 1.25.
# remove KO hitsdata_results <-subset(data_results, ! ID %in% KO_hits)
4.1.9 KO+S vs KO+L Volcano plot
Rename the 2 isoforms of Aebp2 to match what is used in the literature
Code
data_results |>mutate(name =ifelse(test = name =="Aebp2.1", yes ="Embryo Aebp2", no = name)) |>mutate(name =ifelse(test = name =="Aebp2", yes ="Somatic Aebp2", no = name)) |># actually simplify and label the embryonic Aebp2 just as "Aepb2"mutate(name =ifelse(test = name =="Embryo Aebp2", yes ="Aebp2", no = name)) |>mutate(name =ifelse(test = name =="Jard2", yes ="Jarid2", no = name) ) -> data_results
Select proteins to label and set fill colour for Aebp2 and Jarid2
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
4.3.2 Saint peptide counts
The MzTab file from proteomicslfq doesn’t contain the peptide info per sample, so I use the Saint software (Teo et al. 2014) output.
I created a simple R function to read the peptides of any given UniProtID from the excel output tables using read_saint_peps(). This peptides are then mapped to the protein sequence where I extract start and end position and prepare them for plotting with the function cov2res_peptides(). This functions allow to specify the exon boundary amino acids (with AA_exon_start and AA_exon_end), in this way the function summary_exon_coverage() prints out the number of peptides that map to that exonic region. Lastly the peptides are plotted along the protein sequence using plot_pep_cov().
Please note that there’s no peptide mapping to SUZ12 for sample KO#1 and KO#2 and in the case of KO#3 the peptides are only in exon 1 which is before the SUZ12 exon 2-3 deletion.
Comments on the identification of AS isoforms peptides with Mass spectrometry
Due to MS limitations it is very difficult if not impossible to detect different AS from a label free IP-MS experiment. In the case of SUZ12-S peptide we performed a targeted MS protocol. A Precise Reaction Monitoring (PRM) method was used to acquire the data where we selected only 3 specific SUZ12 peptides for further fragmentation.
I could check other AS exons in PRC2 but for now I want to stop here.
4.4 qPCR
Analyse and plot Suz12 gene expression by qPCR in ESCs.
Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow enables reproducible computational workflows.”Nature Biotechnology 35 (4): 316–19. https://doi.org/10.1038/nbt.3820.
Eng, J. K., T. A. Jahan, and M. R. Hoopmann. 2013. “Comet: an open-source MS/MS sequence database search tool.”Proteomics 13 (1): 22–24.
Huber, Wolfgang, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. 2002. “Variance stabilization applied to microarray data calibration and to the quantification of differential expression.”Bioinformatics 18 (suppl_1): S96–104. https://doi.org/10.1093/bioinformatics/18.suppl_1.S96.
Kim, S., and P. A. Pevzner. 2014. “MS-GF+ makes progress towards a universal database search tool for proteomics.”Nat Commun 5 (October): 5277.
MacLean, Brendan, Daniela M Tomazela, Nicholas Shulman, Matthew Chambers, Gregory L Finney, Barbara Frewen, Randall Kern, David L Tabb, Daniel C Liebler, and Michael J. MacCoss. 2010. “Skyline: an open source document editor for creating and analyzing targeted proteomics experiments.”Bioinformatics 26 (7): 966–68. https://doi.org/10.1093/bioinformatics/btq054.
Teo, Guoci, Guomin Liu, Jianping Zhang, Alexey I. Nesvizhskii, Anne Claude Gingras, and Hyungwon Choi. 2014. “SAINTexpress: Improvements and additional features in Significance Analysis of INTeractome software.”Journal of Proteomics 100 (April): 37–43. https://doi.org/10.1016/j.jprot.2013.10.023.
Zhang, Xiaofei, Arne H. Smits, Gabrielle BA van Tilburg, Huib Ovaa, Wolfgang Huber, and Michiel Vermeulen. 2018. “Proteome-wide identification of ubiquitin interactions using UbIA-MS.”Nature Protocols 13 (3): 530–50. https://doi.org/10.1038/nprot.2017.147.